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Abstract. This paper explores the theoretical basis of the covariance ma- 
trix adaptation evolution strategy (CMA-ES) from the information geometry 
viewpoint. 

To establish a theoretical foundation for the CMA-ES, we focus on a geo- 
metric structure of a Riemannian manifold of probability distributions equipped 
with the Fisher metric. We define a function on the manifold which is the ex- 
pectation of fitness over the sampling distribution, and regard the goal of 
update of the parameters of sampling distribution in the CMA-ES as maxi- 
mization of the expected fitness. We investigate the steepest ascent learning 
for the expected fitness maximization, where the steepest ascent direction is 
given by the natural gradient, which is the product of the inverse of the Fisher 
information matrix and the conventional gradient of the function. 

Our first result is that we can obtain under some types of parameterization 
of multivariate normal distribution the natural gradient of the expected fitness 
without the need for inversion of the Fisher information matrix. We find 
that the update of the distribution parameters in the CMA-ES is the same 
as natural gradient learning for expected fitness maximization. Our second 
result is that we derive the range of learning rates such that a step in the 
direction of the exact natural gradient improves the parameters in the expected 
fitness. We see from the close relation between the CMA-ES and natural 
gradient learning that the default setting of learning rates in the CMA-ES 
seems suitable in terms of monotone improvement in expected fitness. Then, 
we discuss the relation to the expectation-maximization framework and provide 
an information geometric interpretation of the CMA-ES. 
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1. Introduction 

The covariance matrix adaptation evolution strategy (CMA-ES; e.g., [14, 15]) is 
the leading stochastic and derivative- free algorithm for solving continuous optimiza- 
tion problems, i.e., for finding the optimizer x* of a real-valued objective function /, 
aka fitness, defined on (a subset of) R d , which we assume to be maximized without 
loss of generality. The CMA-ES generates candidate points {x^}, i £ {1, 2, . . . , A}, 
from a multivariate normal distribution and evaluates their fitness values {/(x^)}. 
Then, it updates the mean vector and covariance matrix of the multivariate normal 
distribution by using the information of the sampled points and their fitness values, 
{(xi, /(xj))}. Repeating the sampling-evaluation-update procedure, the CMA-ES 
moves the sampling distribution to a promising area over and over, and is expected 
to find a neighborhood of the optimizer. At least, we do not expect it to converge 
to a non-stationary point of the objective function [1]. 

The method used to improve the parameters of the sampling distribution strongly 
determines the behavior and efficiency of the whole algorithm. The CMA-ES up- 
dates the parameters so that it encourages to reproduce previously successful search 
steps. To do so, the CMA-ES, especially the rank-// update in the CMA-ES [14] is 
based on a maximum-likelihood estimation. Hence, the CMA-ES can be considered 
to be based on a statistical principle. 

Recently, Wierstra et al. [28] proposed a novel algorithm named natural evolu- 
tion strategy (NES), which was subsequently developed further by Sun et al. [25,26] 
and Glasmachers et al. [12]. In NESs, the objective of the parameter update is con- 
sidered to be maximization of the expected fitness E[/(x)], where the expectation 
is taken under the current sampling distribution, and a natural gradient [5] based 
approach is employed. Thus, NESs are considered to be derived from a principle of 
information geometry and, from their nature, constitute a more principled approach 
than the CMA-ES. 

This paper addresses the theoretical justification for the CMA-ES from the infor- 
mation geometry viewpoint and gives a mathematical interpretation of the CMA- 
ES. For this purpose, we consider a geometric structure of a Riemannian manifold 
of probability distributions equipped with the Fisher metric, and define an alternate 
maximization problem on the manifold: the objective function is the expectation 
E[/(x) | 9] of the fitness function, where the expectation is taken under the normal 
distribution parameterized by 9, and the arguments are the parameters 9 of the 
normal distribution. Then, we investigate natural gradient learning, i.e. steepest 
ascent learning on the manifold, for the expected fitness maximization. This idea 
is thoroughly inspired by the formulation of NESs. 

The first result of this paper is an analogy between the CMA-ES and natural 
gradient learning for expected fitness maximization. We show that the natural 
gradient, which is given by the product of the inverse of the Fisher information 
matrix of the normal distribution and the conventional gradient, can be directly 
estimated without calculation of the Fisher information matrix and its inverse under 
some particular parameterization of the normal distribution. Then, we see that the 
natural gradient learning for maximizing the expected fitness where the natural 
gradient is estimated from the samples in a particular parameterization, has the 
same form of parameter update as the CMA-ES. This part of the paper is the 
extension of our previous study [2] . 
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The second part of this article deals with the learning rate parameter. The 
natural gradient view of the CMA-ES gives us an insight into the learning rate: 
the learning rate does not only possess an effect of reducing fluctuation of the 
parameters due to the variance of the natural gradient estimate, but also takes 
control of the step-size along with the natural gradient. In a general scheme of 
gradient-based learning, scheduling of the learning rate is an important factor in 
determining the speed and accuracy of convergence and the optimal learning rate 
varies with the function and the position of the parameter [5]. However, the learn- 
ing rates in the CMA-ES are usually fixed during learning and they are different 
for the mean vector and for the covariance matrix. Here, an interesting question 
arises as to why the CMA-ES performs well with constant learning rates within 
(0, 1] that are different for each parameter. To confirm the validity of this set- 
ting, we derive the range of learning rates which guarantee that a step along the 
exact natural gradient improves the expected fitness value. Then, we discuss the 
similarity to the fitness expectation-maximization algorithm [27] which is based 
on expectation-maximization (EM; [10]) framework, and provide an information 
geometric interpretation of the CMA-ES as natural gradient learning for expected 
fitness maximization. 

The rest of this paper is organized as follows: Section 2 introduces the CMA-ES. 
Section 3 introduces the framework of natural gradient learning for expected fitness 
maximization. Section 4 derives the form of the natural gradient estimate and shows 
that the CMA-ES and natural gradient learning for expected fitness maximization 
have the same form of parameter update and that we can describe the CMA-ES 
and NESs using the same framework. Section 5 provides the range of learning 
rates so that exact natural gradient learning leads to monotone improvement in the 
expected fitness, followed by a discussion about the learning rates in the CMA-ES. 
We discuss the relation to the EM-inspired algorithm [27] and the correspondence 
to the framework of generalized EM (GEM) algorithms [10]. We conclude with a 
summary in Section 6. 

2. Covariance Matrix Adaptation Evolution Strategy 

Let 7r(x; m, a 2 C) represent the probability density function of the multivariate 
normal distribution with mean vector m and covariance matrix a 2 C. Here a is a 
scalar and we call a a global step-size in the context of CMA-ES. The CMA-ES [ ] 
repeats the following steps after choosing the initial parameters m°, a and C° and 
setting = and = 0. 

(1) Sample A independent points xi , . . . , xa from ir (x; m*, (cr*) 2 C*) . 

(2) Evaluate the fitness values /(xi), . . . , /(xa). 

(3) Update the parameters as follows. 



Mean vector:: 



A 




i=l 



where Ri represents the ranking of /(xi), i.e., Xj has the Rf 1 highest 
fitness value among /(xi), . . . , / (xa); and w^ ; represents the weight 
for the Rf 1 highest point and has the following properties: < Wj < 
Wj < 1 for any i > j and Xh=i w « = !• 
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Global step-size:: 



a t+1 ^a'exp 



Xd 

where c a and d a are the learning rate and the damping parameter, 
respectively; \d denotes the expectation of the chi distribution with d 
degrees of freedom; p CT is an evolution path that is updated as 



pt +1 = (l-c CT )p* + 
Covariance matrix:: 



/c CT (2- Cff )(C t )- 1 /2( m *+i_ m t ) 



Etiw 2 



Xi — m / x, — m x 



C t+1 = (1 - Cl - cJC* + cxp^ 1 ( P ^ +1 ) T + c M £ w fli 

i=l 

where ci and c M are learning rate parameters and pc is an evolution 
path that is updated as 



t4-i . .i /c c (2 — c c ) m t+1 — m f 
p^ +1 = (1 - c c )p^ + , ^ A C 9 J 



Here c c is the learning rate for the evolution path update. 

The parameter adaptation in the CMA-ES is based on two principles. The first 
one is the maximum likelihood estimation (MLE). The update rules for m and the 
third term of the covariance matrix adaptation, called rank-^i update, can be inter- 
preted as MLE. They are adapted so that it increases a weighted log-likelihood of 
previous samples, where points with higher fitness value have greater weights. The 
second one is the accumulation of successful steps. The step-size adaptation and 
the second term of the covariance matrix adaptation, called rank-one update, rely 
on the paths p,j and pc- They are called evolution paths. Evolution paths contain 
information about the correlation between successive successful steps. Although 
evolution paths are reported to be unstable when A is large [3, 14], they have a 
large effect on search speed and accuracy when A is small. 

In what follows, we investigate a simplified CMA-ES called rank-// only CMA- 
ES in which the global step-size and evolution paths are removed. The resulting 
update rules reduce to 



(1) m t+1 = m* + rim W R* ( x 'i - m ') 



i=l 
A 



(2) C t+1 = C* + iic J2 (( x * ~ mt X x » - m ') T - C ') . 

i=i 

where Tj m and r\c ar e learning rate parameters. 

3. Natural Gradient Learning for Expected Fitness Maximization 

In this section, we introduce natural gradient learning for expected fitness maxi- 
mization. But, we start with the definition of statistical manifolds and the concept 
behind the natural gradient. Then we formulate an expected fitness maximization 
and the framework of natural gradient learning. 
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Statistical Manifold. Information geometry [7] is the study of the natural differ- 
entiable geometric structure of manifolds of probability distributions. Consider a 
family S of probability distributions on K d parameterized using n real-valued vari- 
ables 9 = [9i . . . 9 n ] so that S = {pg = p(x; 9) \ 9 6 0}, where 9 is a subset of IR n 
and the mapping 9 M> pg is an injection. Such a set S is called an n-dimensional 
statistical model on M. d . The mapping <p : S — > K™ defined by (p(pe) = 9 is viewed 
as a coordinate system for S. With a Riemannian metric, termed Fisher metric, 
defined by the Fisher information matrix 

(3) FW ./~)(«)) T !)(x;9)dx , 

we can consider S as a Riemannian manifold and then we call S a statistical man- 
ifold. 

It is possible to define an infinite number of Riemannian metrics on S. However, 
we find that there are properties that distinguish the Fisher metric from other 
metrics. One good property is that the Fisher metric is the only invariant metric 
under the choice of coordinate system [7, Section 2.4]. The invariance is important 
in order to consider the intrinsic geometric structure of manifolds. The fact that the 
Fisher information matrix is the curvature of the KL-divergence [2(J, Section 2.6] is 
also a supportive property because the KL-divergence is commonly used to measure 
the difference between two probability distributions. Hence, the Fisher metric is 
considered as the most natural Riemannian metric on statistical manifolds. 
Natural Gradient. Consider p as a function defined on a Riemannian manifold S 
equipped with a Riemannian metric G with coordinate system ip : pg i— > 9. Let 
p v (9) = p ((^ _1 (6 1 )). On the Riemannian manifold S, the steepest ascent direction 
of pip is not usually given by the conventional gradient direction \7p ip (9). The 
natural gradient [5] 

(4) Vp v (9)=G-\9)V Pv (9) 

gives the steepest ascent direction of p^ on (S, G) and it is invariant under the 
choice of coordinate system. Natural gradient learning has been used as an efficient 
learning algorithm in several fields of machine learning [5,6,23]. 
Expected Fitness. Let 7r(x; 9) = tt (x; m(9), C(9)) and O be a set of 9 where C(9) 
is nonsingular. Then, the expected fitness with respect to 7r(x; 9) is defined as 

(5) J{9) = E[/(x); 9}= J /(x)tt(x; 9) dx. 

The function J(-) can be considered as a function on a statistical manifold. 
Natural Gradient Learning for Expected Fitness Maximization. Since the metric 
G(9) on a statistical manifold is given by the Fisher information matrix F(0), the 
steepest ascent direction can be given by the natural gradient V J(9) = F^ 1 (9)WJ(9). 
For the case of normal distributions, the (i,j) th element of the Fisher information 
matrix has a well-known explicit form [18, p. 47 and Appendix 3C] 
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The gradient can be expressed as 

VJ(0) = V J /(x)tt(x;0) dx = y /(x)Vtt(x; 0) dx 

(7) = J /(x)7r(x;0)Vln7r(x;0) dx, 

where the second equality holds under some regularity conditions which are de- 
rived from Lebesgue's dominated convergence theorem (see e.g. [8, Theorem 16.3]). 
Therefore, the natural gradient is expressed as 

(8) VJ(0) = J /(x)F- 1 (6»)Vln7r(x;6l)7r(x;0)dx. 

Since the fitness function is unknown, so is the expected fitness and its natural 
gradient. We estimate the natural gradient by the Monte-Carlo approximation: 

(9) 5(9 | {xj) = J2 ^F- 1 (0)Vln 7 r(x 4 ;0). 

i—l 

Here, we can calculate the inverse of the Fisher information matrix (6) not neces- 
sarily analytically but numerically. Using the estimate 6(9 | {x^}) natural gradient 
learning for expected fitness maximization adjusts the parameter in the following 
rule: t+1 = 9 t + i 1 S(9 t | {x 4 }). 

Natural Evolution Strategies. NESs adjust the parameters on the basis of the nat- 
ural gradient on the expected fitness, but they non-linearly transform the fitness 
function. In the Monte-Carlo approximation of the natural gradient (9), NESs re- 
place /(xi)/A with a ranking based weight w# ; . We call this transformation ranking 
based fitness shaping. The fitness shaping makes NESs enjoy the invariance prop- 
erty under order preserving, i.e. monotone, transformation of fitness function, as 
done in the CMA-ES. 



4. Analogy of the CMA-ES to Natural Gradient Learning 

This section discusses the analogy between the CMA-ES and natural gradient 
learning, which follows from the derivation of the explicit form of the natural gradi- 
ent on the expected fitness. At the end of the section, we remark on some variants 
of the CMA-ES. 



4.1. General Form of the Natural Gradient. Let 9 be a set of parameters 6 
such that the normal distribution 7r(x; 9) is nonsingular; i.e., the Fisher information 
matrix F(9) is nonsingular. We suppose that the parameter vector is divided into 

Tl 

Cl 



two parts [0^,0<E] T , and 



9m , <9vech(C) 

(10) W c =° and -d9T 1 = 

hold at 9 € 0, where vech denotes the half-vectorization operator that maps a 
d-dimensional square matrix to a d(d+ l)/2-dimensional column vector that stacks 
columns starting at the diagonal elements of the matrix (see e.g., [16, Chapter 16]). 
The assumption (10) is satisfied if m and C only depend on 9 m and 9c, respectively, 
which is satisfied in the cases that we treat in the later sections. Then, the Fisher 
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information matrix has the block form F(9) = diag(F m (#), Fc(9)) and we have 
from (9) 

A 



(11) ^|{xj)=^ / 



A 



F- 1 WV 9m ln7r(x i ;i 
P^WVflo m7r( Xi ;< 



Thus, we have the explicit form of the estimate of the natural gradient at 9 if we 
can analytically evaluate each block of the right-hand side of (11). However, it is 
not trivial to calculate the inverse of the Fisher information matrix and express it 
in terms of m and C. 

The following theorem shows that we can directly obtain the product of the 
inverse of the Fisher information matrix and the gradient of the log-likelihood 
without inversion of the Fisher information matrix. 

Theorem 4.1. Suppose 9 m and 9c are d- and d(d + 1)/ 2- dimensional column 
vectors, respectively. Then dm/d9j n and dvech(C)/d9^ are invertible at 6 £ O, 
and 

(12) F-^^V^ ln^(x | 9) = (x - m) 

(13) F- 1 (g)V ec ln^(x|g)= ( ^t C) ) ^((x-m)(x 

Theorem 4.1 shows that if the derivatives of the mean vector and the covariance 
matrix with respect to 9 m and 9c have simple forms and their inverse matrices can 
be easily expressed in terms of m and C, then we can obtain the form of the natural 
gradient (8) and the estimate 8(9 | {x;}) analytically by using (12) and (13). In 
most cases, the additional inversion is easier to perform than the inversion of the 
Fisher information matrix. 

It is worth mentioning that the natural gradient can be also derived by the way 
taken by Glasmachers et al. [12]. To avoid the computation of the Fisher informa- 
tion matrix, they introduce a local coordinate on S where the Fisher information 
matrix is identical to the unit matrix. They show the statement of the natural 
gradient under exponential parameterization described in Section 4.3. 

Proof. First, we derive the inverse matrix of each block of the Fisher information 
matrix. From (6) and assumption (10) we have the block of the Fisher information 
matrix corresponding to 9 m 

Since dm/d9j n is a d-dimensional square matrix, it must be invertible if F m is 
invertible. Since F is nonsingular at 9 G 6, F m is invertible. Thus, dm/d9j n is 
invertible. Then, the inverse matrix of F m is expressed as 



(15) F"'= hsr C 



( 3m \ 



\d9l) ^[\d9jj 

From (6), assumption (10), and the formula of matrix differentiation (see e.g., [16, 
Chapter 15]) 

(16) M C d9,° ' 



AKIMOTO ET AL.: IN ALGORITHMICA, ONLINE FIRST (2011) 9 

we have the {i,j) th element of the block of the Fisher information matrix corre- 
sponding to 6c as 

, . 1 /„iflC nl 8C\ 1 fdC- 1 dC 
(Fckj = ^tr C- 1 — -C- 1 — - = --tr ' 



2 V 96 c ,i d6 c , 3 J 2 \d6c,id6 C j 



= — vech 2— diag — — vech 



2 V 30c, \d6c,iJJ \d6 c , : 

1 /avech(2C- 1 - diag(C- 1 ))\ T dvech(C) 



2 V 90 c ,, / d6 Ctj ' 

where diag(C) represents a diagonal matrix whose diagonal elements equal the 
diagonal elements of C. Then, we have the matrix form 



(17) F c = ~2 



1 /<9vech(2C- 1 - diag(C- 1 ))\ T <9vech(C) 



V 961 ) 861 



Since both 9vech(2C _1 — diag(C~ 1 ))/90p and dvech.(C) / dO^ are square matri- 
ces of dimension d(d + l)/2, they must be invertible if Fc is invertible. By 
the assumption (10), F is invertible for 6 £ 8, and hence, so is Fc- Thus, 
9vech(2C _1 — diag(C _1 ))/i90j and dvech(C) / d9 c are invertible and the inverse of 
Fc is expressed as 



/3vech(2C- 1 - diag(C- 1 )) 



A 961 

Next, we derive each block of the gradient of the log-likelihood hi7r(x;£>). The 
log-likelihood function for the normal distribution is written as 

, , „, dln27r IndetC trfC^fx-mJfx-m) 1 ) 
(19) hi7r(x;(9) = . 

Then, in light of formula (16) and another formula of matrix differentiation (see 
e.g., [16, Chapter 15]) 

dlndctC / ,dC 
- tr CT 1 



96, \ 86, 

the partial derivative of (19) with respect to 6i can be written in the form 

31n7r(x;0) 1 /dCT 1 ., nT .\ 9m T „ 

(20) = --tr ((x - m)(x - m) T - C) U —C^x - m). 



According to assumption (10), we have 



t~~\ , / n \ <91n7r(x;0) / 3m\ T „ 

(21) V em ln 7r( x;0 ) = ^A^i = ^_j C -(x-m). 

By rewriting the first term of (20) as 

1 / dC^ 1 
~ 2 tr \~d~6~ ^ X ~ m ^ X ~ m ' )T ~ C ' 



1 ^vech(2C- 1 -diag(C- 1 )) \ T 

n 9&V, j vech (( x - m )(-- m ) -c) 



T 

T 
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we have the block of the gradient corresponding to 9c as follows 



(22) V ec ln^(x; 



d\nir{x;6) _ 1 /9vech(2C- 1 -diag^C- 1 ))" 1 
09j ~ "2 V d9j 



• vech ((x — m)(x — m) T — C) . 

Taking the product of (15) and (21) and the product of (18) and (22), we have 
finally (12) and (13). This completes the proof. □ □ 

4.2. Theoretical Foundation for the Parameter Update in the CMA-ES. 

Theorem 4.1 is useful to derive the explicit form of the natural gradient learning 
algorithm under some parameterization. Consider one of the simplest parameteriza- 
tion: m(0) = 6 m and vech(C(0)) = 9 C . Since dm/dO^ = I and dvech(C)/<96£ = I, 
from (11), (12), and (13), we have the update rules for natural gradient learning 



(23) t+1 -0* + r / ^ /(Xl) 



= i 



A 



x 4 - m(0*) 
vech ((xj - m(0<))(x; - m(0*)) T - C(0*)) 



Let m* = m(0') and C* = C(0*). Separating (23) into an m-part and C-part, we 
have 



(24) m *+i = m * + 77 ^/_*_l( x . 



!=1 

X 



(25) c t+i = c* + v j2 ((Xi " mt)(Xi ~ mt)T ~ ct) • 

i=l 

We notice that the update rules (1) and (2) in the CMA-ES are the same as 
(24) and (25) derived from natural gradient learning, except that the CMA-ES 
uses ranking-based weights instead of raw fitness values /(xj)/A and employs 
different learning rates for m and C. In other words, when using a common value 
Vm = f]c = V an d assigning w^ = /(xj)/A for every iteration, the rank-/i only 
CMA-ES updates the distribution parameters along the sampled natural gradient 
of the expected fitness. 

The coefficients /(xj)/A in natural gradient learning approximately sum up to 
J(0), because /( x 0/^ i s a Monte-Carlo estimate of the expected fitness (5), 

and they increase as the expected fitness increases. On the contrary, the weights 
Wj in the CMA-ES are fixed and sum up to one. Therefore, with the fixed learn- 
ing rates, the adjustment for the parameters in the CMA-ES is approximately 
1/J(0) times as large as that in (24) and (25). Providing that J(0) is positive, 
this corresponds to the relation between VJ(0) and VlnJ(0) = VJ(0)/J(0). 
By replacing VJ(0) and J(0) with their Monte-Carlo estimates 6(6 | {xi}) and 
J(0 | {x;}) = ^2i = i f(*-i)/ \, we have a sampled natural gradient of the log of 
expected fitness: 

(26) S lnJ (9 | {x 4 }) = J2 , /(X ;i F-^VlnTrfofl). 

,=i Ej=i/(*j) 

Then, we obtain the update rules for the m and C-parts by replacing / (xj)/A in (24) 
and (25) with /(xj)/ X^=i /( x i)- We notice a closer relation between the CMA-ES 
and the natural gradient of the log of expected fitness: not only are the forms of 
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their learning rules the same, but the coefficients in natural gradient learning using 
(26) also share properties with the commonly-used weight setting in the CMA-ES. 

However, this algorithm is not invariant under monotone transformation of fit- 
ness function, whereas the CMA-ES is invariant under such transformation and 
the invariance is an important property of the CMA-ES. More study about the 
coefficients is an important future work. 

In short, this result provides a theoretical justification for the parameter update 
in the rank-/i only CMA-ES. Since the natural gradient points to the steepest ascent 
direction of a function defined on a Riemannian manifold, the CMA-ES turns out to 
be based on a steepest ascent method with sampled natural gradient of (the log of) 
the expected fitness on the parameter space, which is a well-principled approach. 



4.3. Remarks. There are some remarks that can be made on the results. 
CMA-ES and NES.. Now that we have found the CMA-ES is based on the sam- 
pled natural gradient on the expected fitness, it is clear that the CMA-ES can be 
considered a variant of NESs. With the same fitness shaping (mapping raw fitness 
values to ranking-based weights), the rank-^t only CMA-ES can be described in 
the framework of NESs. The original NES [28] and efficient NES (eNES) [25, 26] 
use Cholesky parameterization: vech(A) = 9c, where A is the (lower triangular) 
Cholesky factor satisfying C = AA T . Exponential NES (xNES) [12] employs ex- 
ponential parameterization vech(B) = 9c, where C = exp(B), and the CMA-ES 
parameterizes the distribution by vech(C) = 9c- Although the natural gradient it- 
self is invariant under the choice of coordinate system, a finite step along the natural 
gradient leads to a slightly different learning rule under nonlinear transformation 
of the coordinate system as done in eNES, xNES, and the CMA-ES. 
Restricted Coordinate System. For some restricted covariance matrix cases, we can 
attain the corresponding form of the natural gradient in the same manner as in the 
proof of Theorem 4.1. For instance, if 9c is a scalar and C(9) = <r(9c)Co, where a 
is a function and a(#c) > for 9 £ 0, and Cq is fixed, we have 



¥ c 1 {9)Ve c hi7r(x 



da_\ "V ( x - m ) T Co *(* - m) 

d9 C/ 



For instance, if 9c is a d-dimensional column vector and C{9) is a diagonal matrix 
whose i th diagonal element is (Ji(9), where oi are functions such that ai(9) > for 
9 G 0, we have 



F^V^ ln7r(x| 



d[cr 1 ,...,<j d \ 
39c 



-l 



T 



[(x - m)i - (7i, . . . , (x - m) 2 d - <j d ] T 



sep-CMA-ES and Restricted Coordinate System. Ros and Hansen [24] proposed a 
variant of the CMA-ES, named sep-CMA-ES, in which the covariance matrix is 
constrained to be diagonal. The sep-CMA-ES without the rank-one update [15] 
updates the diagonal elements <7j of the covariance matrix C = diag(eri, . . . , <r d ) as 
follows: 

A 

°t +1 = o*i + Vc (( x ~ - °"0 ■ 

i=i 

This is the same as the covariance update rule derived from natural gradient learn- 
ing when using a diagonal parameterization: C(9) = diag(6>c,i, ■ ■ • , Oc,d)- 
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Active-CMA-ES and Fitness Baseline. Consider the following equalities 



E[(/(x) - 6)Vhi7r(x; 0)} = VE[/(x) - b] = VE[/(x)] - Vb 

= VE[/(x)]=E[/(x)Vhi7r(x;0)]. 



Thus, subtraction of b from the fitness does not affect the expectation of the gra- 
dient estimation but does affect the variance of the estimation. This fact is used 
to reduce the variance of Monte-Carlo estimates and b is referred to as a baseline 
(see e.g., [11,23,26]). The natural gradient view and this fact clarify the relation 
between the CMA-ES and active-CMA-ES [17]. Active-CMA-ES was proposed to 
reduce covariance adaptation time by reducing actively the elements of the covari- 
ance matrix corresponding to unsuccessful search directions and is implemented by 
using weights w^. that are possibly negative and sum up to zero, whereas they are 
nonnegative and sum up to one in the CMA-ES. When the weights in active-CMA- 
ES are equal to the weights in the CMA-ES minus some value, active-CMA-ES and 
CMA-ES estimate the same natural gradient with and without a baseline. 

5. Correspondence to the Generalized Expectation Maximization 

In this section, we discuss the learning rates for natural gradient learning for 
expected fitness maximization. We derive the range of learning rates that ensure 
monotonic improvement in the expected fitness if the exact natural gradient is given. 
Then, we validate the setting of learning rates used in the CMA-ES. Finally, we 
discuss the relation to the fitness expectation maximization algorithm [27], which is 
an EM-inspired algorithm for continuous optimization, and provide the information 
geometric interpretation of the CMA-ES. 

5.1. Monotone Improvement in the Expected Fitness. The learning rates in 
the CMA-ES are usually fixed during learning. They are small positive constants 
when the sample size A is small, and reach values up to one when the sample size 
is large. In addition, they are different for the mean vector and for the covariance 
matrix. Considering the analogy to natural gradient learning, such a setting of 
learning rates is exceptional since the optimal step-size (learning rate) generally 
varies with the function and the position, and different learning rates make the 
adjustment vector stray from the steepest gradient. 

To confirm the validity of such setting for the learning rates, we derive the range 
of learning rates that guarantee monotonic increase in the expected fitness. Suppose 
that /(x) is positive, which holds at least if one defines the fitness as exp(/(x)) 
instead of /(x). Then J (6) > holds and we can view q(x;9) = /(x)-7r(x; 6)/J(6) 
as a probability density function on R d because q(x;6) > and / q(x;6)dx = 1. 
To show that a step-by-step improvement in the expected fitness is guaranteed, we 
consider the following equality: 



In 



j{e>) 



In 



J(0')/(x)tt(x; 6) 
J(0)/(x)7t(x;0') 



+ ln 



tt(x;0) 



In 



g(x;0) 
g(x;0') 



+ ln 



tt(x; 9) 



J{6) 




(27) 



Acl («(x;0) II q(x;6')) + Q(6,6') 



Q{6,6) 
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where Q(6,8') denotes the negative cross entropy — H (?(x; 9), 7r(x; 9')) of q(x;8) 
and 7r(x;0') defined by 

Q{9, 8') = -H (g(x; 9) , tt(x; 0')) = / ?fa 0) In tt(x; 0') dx, 

and Dkl(pi \\ P2) represents the Kullback-Leibler (KL) divergence of P2 from pi, 
defined by Dkl(pi \\ P2) = H(pi,p 2 ) — H(pi). Here H(pi) denotes the entropy of 
P\. Since KL divergence is always non- negative, we have the following inequality 

(28) In J{9') - In J{9) > Q(9, 9') - Q{9, 8) 

with equality holding if and only if 9 = 9'. Thus, if we can choose 9' repeatedly to 
satisfy Q(9,9') > Q(8,8), then step-by-step progress is guaranteed from (28). 

If the natural gradient is estimated sufficiently well, an infinitesimal step in the 
direction leads to an increase in expected fitness. The following theorem shows 
how long a step we can take along the exact natural gradient so as to guarantee 
improvement in expected fitness. 

Theorem 5.1. Assume that J{8) is differentiable. For 8 e ; suppose m(8) = 8 m 
and vech(C{9)) = 8c, and let 



8 m +n m Vg m J{9) 

0c + r)cV ec J(8) 



If V 'q c J{8) ^ 0, then the mapping r/c Q(9,9'(0,r) c )) is strictly increasing in 
rj c G (0, 1/J(9)) and has a local maximum point at rj c = 1/J(9). Moreover, if 
Vj m J(S) 7^ 0, then for any rjc S [0, 1/J(0)] the map r\ m 1— > Q(9,9' (r] m ,r)c)) is 
strictly increasing in r\ m € (0,1/ J (9)) and has a local maximum point at rj m = 
1/J{9). 

Note that Theorem 5.1 does not necessarily hold under other types of parameter- 
ization such as Cholesky parameterization or exponential parameterization. This 
is because they lead to different trajectories, although these are considered as dis- 
cretizations of the same associated ordinary differential equation. Additionally, 
note that rj m = rjc = 1/J(9) gives a local maximum point of Q(9,9 / (ri mj r]c)) 
in rj m and 77c, but Q(9,9) itself does not have a local maximum point at 9 = 
9'(1/J{9),1/J{9)). 

Proof. Let m(9) and C(9) be denoted by m and C respectively, and m(9'(r] m ,, rjc)) 
and C(9'(j] m jVc)) be denoted by m, )m and Cr) C respectively. First, we prove the 
first half of the theorem. The derivative of Q(9, 9'(0, rjc)) with respect to rjc is 

(29) gfflril _ | /(xWx;9)Vfc ^, m , K)w . 

Since m = m and vech(C^ c ) = 9c + TjcS/ e c J{9) = vech(C) + r/cV% J{9), by 
taking (13) into account we have 

Vg c hi7r(x I 9'(0,T) C )) =Fc(9'(0,7 1 c))F c 1 (8\0,rj C ))Ve c ln7r(x; 9'(0,r, c )) 
= F c (8'(0, r/ c ))vech((x - m)(x - m) T - C„ ) 
= F c (9'(0, ?7 C ))(vech((x - m)(x - m) T - C) - rj C Ve G J{9)) 
= F c (9'(Q,r]c))(Fc(9)V ec ln7r(x; 8) - r,cV 0o J{8)). 
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Since E[/(x)F c (0)Ve o ln7r(x; 9)} = F c (8)Ve c J(9) = Ve c J(<9), where the expec- 
tation is taken under 7r(x;#), the derivative (29) reduces to 

(3Q) dQ^mvc)) = (_L Ve c J(OrFc(e'(0, V c))Ve c J(0). 

Here, for Vc G [0,1/ J(0)], 

C,, c = (1 - ?7C J(0))C + 77cE[/(x)(x - m)(x - m) T ] 

is positive definite because (1 — rjc J(8))C is non-negative definite and /(x) > 
means E[/(x)(x — m)(x — m) T ] is positive definite, and the sum of non-negative 
and positive definite matrices gives another positive definite matrix. From the 
continuity of the positivity, C vc is positive for rjc G [0,1/ J (9) + e) for small e. 
Hence, the Fisher information matrix Fc(0'(O, tic)) 1S a l so positive definite for 
r\c G [0, 1/J{9) + e). Thus, the right-hand side of equation (30) is positive if 77c G 
[0, 1/ J(9)), zero if t 1c = 1/ J (9), negative if rjc G (l/J(9),l/J(8)+e). Consequently, 
we find that Q(9,9'(0,rjc)) is strictly increasing with respect to r\c G [0,1/7(6*)) 
and it has a local maximum point at rjc = 1/J(8), which completes the proof of 
the first half. 

Next, we show the last half of the theorem. The derivative of Q(9,9'(r] rn ,r]c)) 
with respect to r\ m is 



dQje^'j^nc)) 

dr\ m 



V 9m J(0) T E[/(x)V Sm hi7r(x I tf^ric))]/ J{0) 

= V 6m J(0) T E[/(x)(C„ c )- 1 (x - m Vm )]/J(9) 

= V 9m J(0) T (C I , c )- 1 E[/(x)(x m) 77 m V 9m J (6)] /J (6) 

= (l/J(9) - Vm )\7 em J(6) T (C vc )- 1 Ve m J(9). 

Taking into account that C vc , is positive definite for rjc G [0, 1/J(6)], it is easy to 
verify that Q(9, 9'(r] m , rjc)) is strictly increasing with r\ m G [0, 1/ J{9)] and has the 
peak at r) m = 1/J(9). This completes the proof. □ □ 

To provide an intuitive explanation of this theorem, we first show what happens 
at the maximum point. Let rj m = rjc = 1/ l J(9 t ). Then, according to Theorem 4.1 
we have 

(31) m , +1 = |/(^!) xdx , 

(32) C'« = J /( *^* ; Q (x - m')(x - m'f Jx. 

That is, the past information is forgotten and the next estimates are only de- 
termined by the current information when the learning rates are taken so as to 
maximize the lower bound (28). 

Now we restate Theorem 5.1. For large A such that the estimates (24) and (25) 
approximate the natural gradients sufficiently well, r\ m = rjc = i/J(6 ) seems to 
be the best choice. Then, the next estimates become (31) and (32). Therefore, the 
theorem says that moving the parameters toward (31) and (32) leads to increase of 
the expected fitness even when we assign different values to learning rates r\ m and 
7]c- Fig. 1 illustrates the relation between the natural gradients, the target points, 
and Q{9\-). 
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cf 



I 



(x — m')(x — m') T dx 




Figure 1. 
9 l . the tar; 

0(0*, ■)■ 



The relation between the natural gradient of J(9) at 
;et points, and the contour lines (solid gray curves) of 



5.2. Justification of the Learning Rates in the CMA-ES. Remembering that 
VJ(0)/J(0) = Vln J(0) and that the update rules (1) and (2) in the CMA-ES are 
more similar to Vln J(0) than VJ(0)/J(0), which is mentioned in Section 4.2, 
Theorem 5.1 justifies the constant and different learning rates in the CMA-ES: 
When A is large enough, it can be considered appropriate to set the learning rates 
to nearly one, because the lower bound (28) of the increment in the log of expected 
fitness is maximized then. When A is not large enough, smaller learning rates seem 
to be appropriate to avert a fluctuation of parameters due to the large variance of 
natural gradient estimation. Since 9 m and 9c have different sizes and the variances 
of the gradient estimates differ between m-part and C-part, it is natural to set the 
learning rates to different values. 

5.3. Similarity to the EM-based Algorithm and Information Geometric 
Interpretation. From Theorem 5.1, we can view natural gradient learning for ex- 
pected fitness maximization as an iterative method for finding the value of t+1 that 
improves Q(8 t ,8 t+1 ) compared to Q(9 t ,9 t ). This is similar to the fitness expecta- 
tion maximization [27], whose framework is inspired by expectation maximization 
(EM) algorithms [10]. Here we discuss the relation to the EM-based algorithm to 
introduce an information geometric interpretation of the CMA-ES. 

EM and EM-based Search Algorithms. In semi-supervised learning scenarios, EM 
algorithms seek to find a maximum-likelihood estimate of parameters of statisti- 
cal models that depend on latent variables by alternating between an expectation 
(E) step and a maximization (M) step. The E-step calculates the expectation of 
the log-likelihood using the current estimate and the M-step finds the parameter 
that maximizes the expectation. In reinforcement learning [9, f 9] and continuous 
optimization [27] scenario, EM based algorithms seek to find the optimal parame- 
ters that maximize expected reward or expected fitness by taking into account the 
inequality (28). The counterpart of E-step calculates the expectation Q(9 t ,6 t+1 ) 
of the log-likelihood function hi7r(x; 0* +1 ) under g(x;0*) defined previously. The 
counterpart of M-step finds the 8 t+1 value that maximizes Q(8 t , 9 t+1 ). The fitness 
expectation maximization algorithm constitutes an algorithm similar to the esti- 
mation of multivariate normal algorithm (EMNA g i bai ; [21]), which is a variant of 
estimation of distribution algorithms (EDA). 
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Geometric View of the EM-based Algorithm. Let SV = {irg = 7r(x; 9) | 9 £ 0} and 
S q = {qe = /( x ) 7r ( x ; 6)1 J(6) | # <E 0} be statistical manifolds. Considering the 
equality 

Q{e\e t+1 )-Q(e\e t ) = -H{q et ^ e t^)+H{qe t ^) 

= -H(qgt,TVgt+i) + H(q g t) + H(q 6 t,-K 9 t) - H(qgt) 

(33) = L>kl(<70* II Tret) - -DKL(ge* || 7r e *+i), 

we find that choosing 6> t+1 so that it maximizes Q(9 t , 9 t+1 ) is equivalent to find ngt+i 
on £„• closest to current distribution qgt on S q with respect to KL divergence. Based 
on the equality (33) and the information geometry view of EM algorithms [4,22], 
we perceive the EM based algorithm as a repeated projection method between 
and S q , where the projection corresponding to the E-step maps irgt to qgt and the 
projection corresponding to the M-step finds irg* G S n that is the nearest from qgt 
with respect to KL divergence (see Fig. 2). 

Information Geometry of the CMA-ES. The EM-based algorithm performs maxi- 
mization of Z?kl(90 4 II Tfl*) ~ D Kh[qe* II ' K e t + 1 ) m 7r e t + 1 i which is a lower bound of 
the expected fitness improvement, but the CMA-ES just moves the sampling distri- 
bution to a distribution on 5 W that is closer (not closest) to the target distribution 
qgt . This corresponds to generalized EM (GEM) algorithms [10] where the M-step 
is replaced with a step that finds the 9 t+1 value that only improves the expected 
value. 

An important property and possibly an advantage of the CMA-ES over the EM- 
based algorithm is that the CMA-ES employs the natural gradient of the expected 
fitness J(-) itself. According to the equality 

In J(9 t+1 ) - In J(9 t ) = D KL (q g t || qgt + i) + D KL (q g t || TTgt) - D KL (qgt || irgt+i), 

which is derived from equalities (27) and (33), the improvement in the expected fit- 
ness is determined by both D^hiqet || <Ze*+i) and -Dkl(90* || ^e*) — -DKL(<Ze f || ■ngt+i). 
The CMA-ES moves the sampling distribution along the natural gradient of the ex- 
pected fitness and turns out to make it closer to the target distribution. It does 
not perform maximization of the second amount but it also takes the first amount 
into account, whereas the EM-based algorithm maximizes the second amount but 
does not take the first amount into consideration (see Fig. 3). 
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FIGURE 3. Geometric Interpretation of the CMA-ES. Dotted gray 
curves represent the contour lines of KL divergence -Dkl(Q0' || •) 
from qgt. 



6. Summary 

We described the analogy between the CMA-ES and natural gradient learning 
for (the log of) the expected fitness maximization in Section 4. If one sets the 
weights in (1) and (2) to be /(x,)/ Ylj=i /( x i) a ^ each iteration, adjustment of 
the parameters in the CMA-ES is equivalent to the estimate of the natural gradi- 
ent of the log of expected fitness. In addition, the weights share some properties 
with practically used weights in the CMA-ES. Next, we investigated the properties 
of natural gradient learning in Section 5. We derived the range of learning rates 
that guarantee that the step along the exact natural gradient will increase the ex- 
pected fitness and justified the use of different learning rates for each parameter. 
By considering the similarity to the EM-based algorithm, we showed that natu- 
ral gradient learning with derived range of learning rates can be considered as a 
generalized EM-based algorithm. Natural gradient learning finds the parameters 
such that the sampling distribution 7r(x;#* +1 ) better matches the current target 
distribution /(x)7r(x; 6 t )/J(6 t ). However, in contrast to the EM-based algorithm, 
it does not minimize the divergence between the distributions but takes the other 
quantity contained in J (6 ) into consideration. Finally, we provided an information 
geometry interpretation of the CMA-ES. 

Our results contribute to the theoretical aspect of the CMA-ES and to the im- 
provement of the CMA-ES. The natural gradient view together with the EM like 
view will help to construct the convergence (stability) theory of the CMA-ES. In- 
formation geometry view might give some insight into more efficient and effective 
parameter updates. 

In this paper, we did not treat the evolution paths. As we mentioned in Sec- 
tion 2, they have a great impact on the performance when A is small. A theoretical 
foundation for the evolution paths is desired. In addition, we did not consider 
the inaccuracy of the natural gradient estimation. We analyze the stability of the 
CMA-ES in the future work. Furthermore, as mentioned in Section 4.2, further 
investigation about fitness shaping, i.e. the coefficients in the natural gradient 
estimation, is also an important future work. 
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